This data set is taken from https://www.fordgobike.com/system-data and represents trips taken by members of the service for month of February of 2019.
Data consists of info about trips taken by service's members, their types, their age, their gender, stations of starting and ending trips, duration of trips etc.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = "plotly"
pd.options.mode.chained_assignment = None
pio.templates.default = 'plotly_dark'
Load in my dataset and describe its properties through the questions below. I will try to motivate my exploration goals.
df = pd.read_csv("201902-fordgobike-tripdata.csv")
df.head(8)
df.info()
df.describe()
Since I'm using the fordbike data, I need to clean it before I start to visualize it.
# Let's make a copy of the dataframe to diffrentiate between the cleaned data and the original data
fgb_clean = df.copy() # fgb_clean = fordgobike cleaned data which creates a copy of the original dataframe
fgb_clean.isna().sum()
fgb_clean = fgb_clean.dropna()
fgb_clean.info()
fgb_clean['duration_sec'] = fgb_clean['duration_sec'] / (60 * 60) # Change from sec to min to hours
fgb_clean.rename(columns={'duration_sec': 'duration_hour'}, inplace=True)
fgb_clean['duration_hour']
fgb_clean["start_time"] = fgb_clean["start_time"].astype("datetime64[ns]")
fgb_clean["end_time"] = fgb_clean["end_time"].astype("datetime64[ns]")
fgb_clean.info()
# Remove unneeded columns from the datset
fgb_clean = fgb_clean.drop(["start_station_id", "start_station_latitude", "start_station_longitude", "end_station_id",
"end_station_latitude", "end_station_longitude"], axis=1)
fgb_clean.info()
# Remove Other gender to keep only Male and Female
fgb_clean = fgb_clean[fgb_clean.member_gender != 'Other']
fgb_clean.info()
# Create a column that display days
fgb_clean.insert(0,"Day", fgb_clean["start_time"].dt.day_name())
# Create TimeOfDay to show if the start time is day or night.
fgb_clean['TimeOfDay'] = 'Day'
fgb_clean['TimeOfDay'][(fgb_clean['start_time'].dt.hour.between(17,24) |
(fgb_clean['start_time'].dt.hour.between(0,5)))]= 'Night'
fgb_clean['TimeOfDay']
The structure of the dataset is composed of 10 columns and each one of them gives an information about the dataset. Here are the columns (duration_hour, start_time, end_time, start_station_name, end_station_name, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip)
1- Count the most usual time for ride to start.
2- Show the most common end stations that people ends on.
3- Show the entire date using start_time.
4- Show member's gender percentages.
5- Show user's type percentages.
6- Caluclating the maximum duration hour and display its day.
7- Show the count of birth_date depending on their duration_hours.
8- Show the birth_date depending on their maximum duration_hour.
9- Show the top 5 End stations by day of week with Females only.
10- Show top 5 End stations by TimeOfDay with Customers only.
11- Scatter the bike sharing of males and females during Tuesday.
After I cleaned the data and dropped unnecessary columns, I'll be using almost all the rest of columns to help me with the investigation into my features of interest.
fgb_clean["start_time"].dt.hour.hist(legend = True)
As its seen in the graph, the most usual time for a rider to start is 17 (5 PM) and after that is (8 AM)
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
I needed to change the datetime of (start_time) column to ensure that the datetime type is changed from Object to datetime64[ns]. then I used fgb_clean["start_time"].dt.hour to make the x axis locates the time in hours instead of the whole date.
Now, I will graph the top 5 end stations in the dataset
counter = fgb_clean['end_station_name'].value_counts()
top_stations = fgb_clean.loc[fgb_clean['end_station_name'].isin(counter.index[0:5])]
fig = top_stations['end_station_name'].value_counts().plot(kind='barh');
fig.update_layout(
title="Top 5 end stations",
yaxis_title="(End station names)",
xaxis_title="(Value)",
legend_title="variable",
font=dict(
family="Courier New, monospace",
color="lightblue"
)
)
fig.update_traces(marker_color='grey')
fig.show()
Let's see start time during the whole date in the dataset
fig = fgb_clean["start_time"].dt.date.plot(kind = 'hist')
fig.update_layout(
title="Start time of the entire date",
xaxis_title="Date",
legend_title="variable",
font=dict(
family="Courier New, monospace",
)
)
fig.update_traces(marker_color='brown')
fig.show()
Another visualization
fgb_clean['start_time'].dt.month_name().hist()
because I mentioned in the introduction that the dataset contains February month only, unlike other datasets that some of them shows the entire year.
labels = list(fgb_clean['member_gender'].value_counts().index)
values = list(fgb_clean['member_gender'].value_counts().values)
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0.2])]) # pull means pull Female from pie chart by 0.2
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title = 'Male Vs Female')
fig.show()
Males are very high with a 76.2% but Females are very few with a 23.8%
labels = list(fgb_clean['user_type'].value_counts().index)
values = list(fgb_clean['user_type'].value_counts().values)
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0])]) # without pull
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title = 'Subscribers Vs Customers')
fig.show()
fig = fgb_clean.plot(kind='scatter', x = 'Day', y = 'duration_hour');
fig.update_layout(
title="Day / Duration hours",
xaxis_title="Day",
yaxis_title="duration_hour",
font=dict(
family="Courier New, monospace",
color="lightblue"
)
)
fig.update_traces(marker_color='lightyellow')
fig.show()
In the graph it shows that Saturday has the highest point, clicking the highest point shows that the duration_hour = 23.485556
Now let's check if it's true or not
max_row = fgb_clean['duration_hour'].idxmax() # get the index of maximum duration hour
fgb_clean.loc[max_row] # locate the row with the maximum duration_hour
Yes I found that the maximum duration hour is 23.485556 and the day is Saturday.
Talk about some of the relationships you observed in this part of the investigation. How did the features of interest vary with other features in the dataset?
In this dataset, I needed to insert a new column called "Day" which consist of the days of week (Sunday,Monday ... Saturday) after that I used scatter plot (by plotly library) to display the duration_hours of every day of the week. So In the graph I can clearly see the maximum duration hour by looking at the highest point.
# Note: This is Bivariate because I used two variables 'member_birth_year' and 'duration_hour' with an aggregate function
fig = fgb_clean.groupby('member_birth_year')['duration_hour'].agg(['count']).plot(kind = 'line', y = 'count')
fig.update_layout(
title="The count of member birth date",
font=dict(
family="Courier New, monospace",
)
)
fig.show()
The graph shows that between 1980 and 1995 is the pinnacle. However it got a massive decrease before it reached 2000, that's because the number of people that their birth date '2000' are very few.
fig = fgb_clean.groupby('member_birth_year')['duration_hour'].agg(['max']).plot(kind = 'scatter', y = 'max')
fig.update_layout(
title="The maximum duration hour by member birth date",
font=dict(
family="Courier New, monospace",
)
)
fig.show()
female = top_stations.loc[top_stations['member_gender'] == 'Female']
female.info()
pio.templates.default = 'plotly'
fig = female.plot(kind='barh', y = 'end_station_name', color = 'Day'); # transparent visualization
fig.update_layout(
title="Top 5 end stations with Females",
yaxis_title="(End station names)",
xaxis_title="(Value)",
legend_title="variable",
font=dict(
family="Courier New, monospace",
color="blue",
)
)
fig.show()
I noticed that Females end station 'San Francisco Caltrain Station 2 (Townsend St at 4th St)' is the highest, with Tuesday being the most common day. However, Saturday and Sunday are so small in the visualization because its in a weekend.
cust = top_stations.loc[top_stations['user_type'] == 'Customer']
cust.info()
pio.templates.default = 'plotly_dark'
fig = cust.plot(kind='barh', y = 'end_station_name', color = 'TimeOfDay'); # transparent visualization
fig.update_layout(
title="Top 5 end stations by TimeOfDay with Customers",
yaxis_title="(End station names)",
xaxis_title="(Value)",
legend_title="variable",
font=dict(
family="Courier New, monospace",
color="lightblue"
)
)
fig.show()
Customer's end station 'San Francisco Ferry Building (Harry Bridges Plaza)' is the highest, with Day being more common than Night.
Were there any interesting or surprising interactions between features?
Yes I was really surprised when I saw that Day is more common than night with Customers, because I thought that it might be the opposite.
tuesday = fgb_clean.loc[fgb_clean['Day'] == 'Tuesday']
tuesday.info()
fig2 = tuesday.plot(kind='scatter', x = 'bike_share_for_all_trip', y = 'duration_hour', color = 'member_gender');
fig2.update_layout(
title="Bike sharing of Males and Females during Tuesday", # depending on duration_hour
yaxis_title="(Duration hour)",
xaxis_title="(Bike sharing)",
legend_title="variable",
plot_bgcolor='rgb(10,10,10)',
font=dict(
family="Courier New, monospace",
color="lightblue"
)
)
fig2.show()
Bike sharing fo all trip are very few with males and females on Tuesday.
Save gathered, assessed, and cleaned dataset to a CSV file named "fgb_clean.csv".
fgb_clean.to_csv('fgb_cleaned.csv', index = False)